Group members: Haoran Li, Haozheng Ni, Mingyang Ni, Chuqi Yang

keywords: scatterplots, shiny interactive plots, parallel coordinates plot, divergent bar plot, word cloud.




1 Introduction




The National Basketball Association (NBA) is a men’s professional basketball league in North America. It is widely considered as the one of the most popular and successful leagues in the world. There is a huge market behind it which is supported by mature methodologies to analyze the performance and values of the players and the teams. However, we believe that these proven approach towards NBA players and teams is not the only way to carry out meaningful analysis of the players and the teams. In our report, we have looked at several seemingly random aspects and eventually shown that they have intrinsically related to the performance of the teams and the players. Integrating the new approach and the well proven approach can bring us valuable insights into this indusry.

Our team is comprised of 4 members:Haoran Li, Haozheng Ni, Mingyang Ni, Chuqi Yang. Haoran Li is in charge of making use class-room methods to analysis the relationship between various factors. Haozheng Ni takes the responsibility of creating an interactive plot in Shiny. Mingyang Ni linked various parts together and produces the final reports. Chuqi Yang carried initial data collection and data cleaning.




2 Description of Data




We have two main sources of data. The numerical data comes from http://stats.nba.com/. This is an official site for NBA data which is very reliable. At the same time, the quality of the data is very high with no missing values or irregular data. The text data comes from twitter API. This set of data is much more challenging as compared to the numerical data. We have carried out extensive cleaning of the data to make it into a usable format. The NBA numerical data has exhibited some very interesting features, such as rounding pattern. We will detailed illustration in the following part.




3 Analysis of Data Quality




Since the numerical data is taking from the official site. We did not observe any missing values or irregular data. The quality of the data is very high. However, during our analysis, we discovered a rounding pattern. We also discovered a pattern in the twitter data.


3.1 Rounding pattern on turnovers




library(devtools)
library(rgdal)
## Loading required package: sp
## rgdal: version: 1.2-18, (SVN revision 718)
##  Geospatial Data Abstraction Library extensions to R successfully loaded
##  Loaded GDAL runtime: GDAL 2.2.3, released 2017/11/20
##  Path to GDAL shared files: C:/Users/Mingyang/Documents/R/win-library/3.4/rgdal/gdal
##  GDAL binary built with GEOS: TRUE 
##  Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
##  Path to PROJ.4 shared files: C:/Users/Mingyang/Documents/R/win-library/3.4/rgdal/proj
##  Linking to sp version: 1.2-7
library(GGally)
library(ggplot2)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(scales)
library(ggthemes)
library(RColorBrewer)
library(viridis)
## Loading required package: viridisLite
## 
## Attaching package: 'viridis'
## The following object is masked from 'package:scales':
## 
##     viridis_pal
library(grid)
library(gridExtra)
library(ggimage)
library(png)
library(gridGraphics)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(forcats)
#devtools::install_github('bart6114/artyfarty')
library('artyfarty')
## 
## Attaching package: 'artyfarty'
## The following object is masked from 'package:ggthemes':
## 
##     theme_economist
library(tm)
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(wordcloud)
location = "C:/Users/Mingyang/Desktop/NBA_data/"
clutch = read.csv(paste(location, 'fetched.csv', sep=""))
df1 = clutch[,c('PF','TOV','team')]
df1= gather(df1,type,count,-team)
df1$count <-  ifelse(df1$type =="PF",df1$count*(-1),df1$count)
temp = temp = df1[df1$type=='TOV',]
new_levels=  as.character(temp[order(temp$count),]$team)
df1$team = factor(df1$team,levels=new_levels)
#df1 <- within(df1, team <- factor(team, levels=names(sort(count,  decreasing=TRUE))))
df1 %>% ggplot(aes(x=team, y=count, fill=type))+
  geom_bar(stat="identity",position="identity")+
  xlab("counts")+ylab("name of teams")+
  scale_fill_manual(values = pal("five38"))+
  coord_flip()+ggtitle("Personal fouls (PF) and turnovers (TOV)")+
  geom_hline(yintercept=0)+
  ylab("counts")+
  xlab("team name")+
  scale_y_continuous(breaks = pretty(df1$count),labels = abs(pretty(df1$count)))+
  theme_scientific()


From this plot, we can observe a very obvious rounding pattern for the turnovers between teams. We can see that the turnovers are in groups of a few step sizes. This is very likely to be caused by the rounding in the data.




3.2 Rounding pattern on free throw




## Preprocess data to merge with the team 
df_name_team = read.csv(paste(location, 'Name_Team.csv', sep=""))
df_name_team = df_name_team[,c("PERSON_ID","Team_Name")]
colnames(df_name_team)[1] = "player_id"

df_name_team_abbr = read.csv(paste(location, 'abbr_team.csv', sep=""))

my_read = function(path,team=df_name_team){
  temp = read.csv(file=path)
  final = merge(temp,team,by = "player_id",all=TRUE)
  #final$Abbri = df_name_team_abbr
  return(final[ ,!(colnames(final) == "X")])
}


df_3pct = my_read(path = paste(location,'3pct_df.csv', sep=""))
df_3fgm = my_read(path = paste(location,'3fgm_df.csv', sep=""))

df_3 = merge(df_3fgm,df_3pct,by = "player_id",all=TRUE)

df_pct = my_read(path = paste(location,'pct_df.csv', sep=""))
df_fgm = my_read(path = paste(location,'fgm_df.csv', sep=""))

df_all = merge(df_fgm,df_pct,by = "player_id",all=TRUE)

df_pts = my_read(path = paste(location,'pts_df.csv', sep=""))

df_fta = my_read(path = paste(location,'fta_df.csv', sep=""))

df_fct = my_read(path = paste(location,'fct_df.csv', sep=""))

df_ftm = my_read(path = paste(location,'ftm_df.csv', sep=""))
df_fta['df_ftm_30sec_plusmiuns_5'] = df_ftm$X30sec_plusminus_5
df_fta_v1 =  df_fta
df_fta_v1_2 = df_fta_v1[!is.na(df_fta$player_name),]
p_fta_ftm = ggplot(df_fta_v1_2)+
  geom_point(aes(X30sec_plusminus_5,
                 df_ftm_30sec_plusmiuns_5,
                 color = player_name,
                 shape=Team_Name),
             size = 1.3,
             alpha=0.5)+
  labs(title = "df_ftm_30sec_plusmiuns_5 V.S X30sec_plusminus_5 ",x = 'X30sec_plusminus_5', y='df_ftm_30sec_plusmiuns_5')
ggplotly(p_fta_ftm)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have
## 31. Consider specifying shapes manually if you must have them.


From this plot, we can observe the free throw made and free throw attempt are rounded to 0.1




3.3 Missing value pattern for Twitter data




pro_76 = read.csv("C:/Users/Mingyang/Desktop/NBA_data/Twitter/By Team/preprocessed_76ers.csv",
                  colClasses=c("NULL", NA, NA))
pro_spurs =read.csv("C:/Users/Mingyang/Desktop/NBA_data/Twitter/By Team/preprocessed_Spurs.csv",
                  colClasses=c("NULL", NA, NA))
pro_warriours = read.csv("C:/Users/Mingyang/Desktop/NBA_data/Twitter/By Team/preprocessed_Warriors.csv",
                  colClasses=c("NULL", NA, NA))
pro_lakers = read.csv("C:/Users/Mingyang/Desktop/NBA_data/Twitter/By Team/preprocessed_Lakers.csv",
                  colClasses=c("NULL", NA, NA))
temp_1 = merge(pro_76, pro_spurs,by ='time',  all=TRUE)
names(temp_1) = c("time","76er", "spurs")
temp_1 = merge(temp_1, pro_warriours,by ='time',  all=TRUE)
names(temp_1) = c("time","76er", "spurs","pro_warriours")
temp_1 = merge(temp_1, pro_lakers,by ='time',  all=TRUE)
names(temp_1) = c("time","76er", "spurs","pro_warriours","pro_lakers")
temp_1[temp_1=="[]"]=NA
#mydf = temp_1[sample(nrow(temp_1), 1000), ]## random sample 1000 rows/records

my_missing = function(seg,title){
  tidydf <- seg %>% 
    gather(key, value, -time) %>%
    mutate(missing = ifelse(is.na(value), "yes", "no"))
  tidydf <- tidydf %>%
    mutate(missing2 = ifelse(missing == "yes", 1, 0))
  p = ggplot(tidydf, aes(x = fct_reorder(key, -missing2, sum), y = fct_reorder(time, -missing2, sum))) +
    geom_tile(color = "white",aes(fill = missing))+
    theme(axis.text.x=element_text(),
        axis.text.y=element_text(size=2,angle=90))+
    labs(title = title,x = 'Team', y='Time')+
    scale_fill_manual(values=c("slategray2", "tomato2"))
  return(p)
}
###data is too huge seperate based on time to see pattern:
####    02:00:00-5:00:00
p1 = my_missing(temp_1[1:213,],title = "Missing 02:00:00-5:00:00")
## Warning: attributes are not identical across measure variables;
## they will be dropped
#### 15:30:00-16:00:00
p2 = my_missing(temp_1[214:1002,],title = "Missing 15:30:00-16:00:00")
## Warning: attributes are not identical across measure variables;
## they will be dropped
####  16:00:00-16:30:00
p3 = my_missing(temp_1[1003:1961,],title = "Missing 16:00:00-16:30:00")
## Warning: attributes are not identical across measure variables;
## they will be dropped
####  16:30:00-17:00:00
p4 = my_missing(temp_1[1962:2829,],title = "Missing 16:30:00-17:00:00")
## Warning: attributes are not identical across measure variables;
## they will be dropped
####  17:00:00-17:30:00
p5 = my_missing(temp_1[2829:3708,],title = "Missing 17:00:00-17:30:00")
## Warning: attributes are not identical across measure variables;
## they will be dropped
####  17:30:00-18:30:00
p6 = my_missing(temp_1[3708:4667,],title = "Missing 17:30:00-18:30:00")
## Warning: attributes are not identical across measure variables;
## they will be dropped
####  18:30:00-19:00:00
p7 = my_missing(temp_1[4667:5461,],title = "Missing 18:30:00-19:00:00")
## Warning: attributes are not identical across measure variables;
## they will be dropped
grid.arrange(p2,p3,p4,p5,p6, nrow = 1)


From this plot, we can observe the a pattern in the missing value of twitter data. The time range is included in the title. Each column represents a team and each row represents a time point. We can observe that there is a concentration of missing value in the bottom 30% of the data. we set up the time point in such a way that for all 31 teams, at each point there need to be at least one twitter from any team. The missing value for the current 4 teams will mean that there are twitter from other teams which are not shown here. We have carried out extensive research online and check out data again and again to make sure we did not make any mistake. However, we did not find a valid explanation for this pattern.




4 Main Analysis (Exploratory Data Analysis)




In our report, we have taken the Macro to Micro approach. We start with a brief overview of the whole league, then we narrow down to compared team-specific performances. Orignially, we have choosen all 31 team and try to analyze in our presentation. Then we realised that plotting 31 teams together actually makes it impossible to interpret. We therefore decide to analyze on the top 2 and bottom 2 teams in each region. Eventually, we zoom into the analysis of individual players. Occassionally, we may break this structure to provide you with a better visual comparison bewteen different seemingly random perspectives and explain how they are related.


4.1 Overview of the whole league




4.1.1 Total number pf games played vs number of wins




#number of games played vs number of wins
df1 = clutch[,c('GP','W','team')]
df1= gather(df1,type,count,-team)
#df1$count <-  ifelse(df1$type =="W",df1$count*(-1),df1$count)
temp = df1[df1$type=='GP',]
new_levels=  as.character(temp[order(temp$count),]$team)
df1$team = factor(df1$team,levels=new_levels)
#df1 <- within(df1, team <- factor(team, levels=names(sort(count,  decreasing=TRUE))))
df1 %>% ggplot(aes(x=team, y=count, fill=type))+
  geom_bar(stat="identity",position="identity")+
  xlab("number of games")+ylab("name of teams")+
  scale_fill_manual(name="type of games",values = pal("five38"))+
  coord_flip()+ggtitle("number of games played (GP) v.s number of wins (W)")+
  geom_hline(yintercept=0)+
  ylab("number of games")+
  xlab("team name")+
  scale_y_continuous(breaks = pretty(df1$count),labels = abs(pretty(df1$count)))+
  theme_scientific()

Unlike other leagues, NBA does not have a fixed number of games for each team. Therefore, it will be pointless to compared the performance of each team sololy based on the number of wins without considering the total matches played. From this simple plot, we can observe that the WAS played the largest number of games while GSW played the smallest number of games. Interestingly, these two team suffice to demonstrate our initial statement on absolute number of wins. WAS has the largest number of wins in the League. However, the highest winning rate comes from GSW despite its relative small number of wins.




4.1.2 Personal fouls (PF) and turnovers (TOV)




df1 = clutch[,c('PF','TOV','team')]
df1= gather(df1,type,count,-team)
df1$count <-  ifelse(df1$type =="PF",df1$count*(-1),df1$count)
temp = temp = df1[df1$type=='TOV',]
new_levels=  as.character(temp[order(temp$count),]$team)
df1$team = factor(df1$team,levels=new_levels)
#df1 <- within(df1, team <- factor(team, levels=names(sort(count,  decreasing=TRUE))))
df1 %>% ggplot(aes(x=team, y=count, fill=type))+
  geom_bar(stat="identity",position="identity")+
  xlab("counts")+ylab("name of teams")+
  scale_fill_manual(values = pal("five38"))+
  coord_flip()+ggtitle("Personal fouls (PF) and turnovers (TOV)")+
  geom_hline(yintercept=0)+
  ylab("counts")+
  xlab("team name")+
  scale_y_continuous(breaks = pretty(df1$count),labels = abs(pretty(df1$count)))+
  theme_scientific()

We have seen this graph above for rounding pattern. The reason we bring up this plot again is due to its relationship with the following plot on aggressiveness and defensiveness. From this plot, we can see that the upperside of the graph has higher TOV and higher PF on average. There is a slight positive correlation between the two statistics.




4.1.3 divergent plot on points decomposition




df1 = clutch[,c('PCT_PTS_2PT','PCT_PTS_3PT','PCT_PTS_FT','team')]
df1= gather(df1,type,count,-team)
temp =  df1[df1$type=='PCT_PTS_2PT',]
new_levels=  as.character(temp[order(temp$count),]$team)
df1$team = factor(df1$team,levels=new_levels)
df1$count <-  ifelse(df1$type =="PCT_PTS_2PT",df1$count*(-1),df1$count)

df1 %>% ggplot(aes(x=team, y=count, fill=type))+
  geom_col()+
  xlab("percentage")+ylab("name of teams")+
  scale_fill_manual(values = pal("five38"))+
  coord_flip()+ggtitle("2PT%,3PT%,FT%")+
  geom_hline(yintercept=0)+
  ylab("percentage")+
  xlab("team name")+
  scale_y_continuous(breaks = pretty(df1$count),labels = abs(pretty(df1$count)))+
  theme_scientific()

This plot gives us a very direct visual presentation on the decomposition of points by different teams. There is a pretty obvious negative relationship between 2PT and FT. A team like TOR has the highest percentage of 2PT and a very low percentage of 3pt. However, CLE is completely opposite side. We have mentioned these two names as they will provide a very good interactive in the following plots on team aggressiveness and defensiveness.




4.1.4 Scatterplot on aggressiveness and defensiveness




library(png)
library(ggplot2)
library(gridGraphics)
library(ggimage)

path = 'https://github.com/NiHaozheng/NBA-Visualization/blob/master/clutch_team/logo/'
#img <- "https://github.com/NiHaozheng/NBA-Visualization/blob/master/clutch_team/logo/ATL.png?raw=true"
df1 = clutch[,c('OFF_RATING','DEF_RATING','team')]
df1$img = paste(path,df1$team,'.png?raw=true',sep='')
ggplot(df1,aes(x=OFF_RATING,y=DEF_RATING))+geom_point()+
  scale_y_reverse()+geom_image(image = df1$img, size = .05)+
  theme_scientific()+
  xlab('offensive rating')+ylab('defensive rating')

In this part of the analysis, we will provide an analysis on the interaction between the previous three plots.

The scatter plot provide us a demonstration of how offensive or defensive the team is. We can observe that MIL is a very defensive team with a very low offensive rating. BOS is a very offensive team with the highest offensive rating.

Teams like OKC, SAS and WAS have high ratings for both scale. This is an indication of their strong performance in both defence and offence which is the quality of a strong team. This is further supported by our plot on total matches played and number of wins. WAS has the highest number of absolute wins. OKC and SAS have their winning rate among the top 5.

We would expect an aggressive team to have a higher number of personal foul. However, by comparing the plot on personal fouls and the offensive rating. There does not seem to be a direct relationship between them.

Interaction between scoring decomposition plots and agg-def plot. Does how aggressive or defensive a team affect the way they score? The answer is easy to tell by comparing the plots on scoring decomposition plots and agg-def. As we have mentioned before: TOR has the highest percentage of 2PT and a very low percentage of 3pt. However, CLE is completely opposite side. We can observe that TOR has a very high defensive rating while CLE has a very high offensive rating. One potential explanation will be that 3 pts is viewed as a much riskier and more offensive scoring method as compared to the much safer 2 pts.


4.1.5 Traditional measure on TSP VS PTS




# Define FGA: Field Goal Attempt 
FGA = df_fgm$overall / df_fct$overall
# Define TSP: True shooting percent 
TSP = df_pts$overall/(2*(FGA+0.44*df_fta$overall))
df_pts['TSP'] = TSP
# Make a copy of df_pts
df_pts_v1 = df_pts
# Subset to remove all the NAs due to players that did not have a team or did not play in 2016
df_pts_v1_2 = df_pts_v1[!is.na(df_pts_v1$TSP),]

##==================================================================
#Plot on whole data, all teams 
p_TSP = ggplot(df_pts_v1_2)+
  geom_point(aes(overall,TSP,color = player_name),size = 1)+
  facet_wrap(~Team_Name)+
  labs(title = "TSP V.S PTS Facet on Team",x = 'Overall PTS', y='Overall TSP')
ggplotly(p_TSP)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
TopLowTeam = c("Celtics","Cavaliers","Warriors","Spurs","Lakers","Suns","76ers","Nets")
TopLowP_TSP = df_pts_v1_2[df_pts_v1_2$Team_Name %in% TopLowTeam,]
TopLowP_TSP['Rank'] = ifelse(TopLowP_TSP$Team_Name %in% c("Celtics","Cavaliers","Warriors","Spurs"), "Top4", "Down4")
p_TSP = ggplot(TopLowP_TSP)+
  geom_point(aes(X5min_plusminus_5,TSP,color = player_name,shape=Rank),size = 2)+
  facet_wrap(~Team_Name)+
  labs(title = "TSP V.S PTS Facet on X5min_plusminus_5 Top4Last4",x = 'X5min_plusminus_5 PTS', y='X5min_plusminus_5 TSP')
ggplotly(p_TSP)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`


From a micro level, we can observe from the graph that in top4, the best players will take over the game in the clutch time, like Lebron Jame in Caverliers, Kyrie Irvine in Celtics, Kawhi Leonard in Spurs, Ste phen Curry and Kevin Durant in Warroris, the reason may be coaches usually trust the best players, and they will make most of the shoots. But it is worth to note that there are some good player in those team that can perform exceptionally well in clutch time, for example, Kyler Korver in Caverliers, Danny Green in Spurs, maybe they should share more shoots.

From a macro level, we can see that strong teams like Celtics and Spurs have a very high True shooting percentage. This is the traditional measure of the performance of a team. Moreover, we have analyzed before that spurs have a high 3pt ratio, yet the rate is so high, it is a reflection of the quality of the team members and leading to the good performance of the team.


Hence, in this part, we illsutrated that we should not look at the traditional data or our data alone. We should integrate them. Spurs true shooting rate is good on its own. However, coupled with its high 3 pcts attempts and its aggressive style, this makes it more valuable.


Moreover, if we look at TSP alone, we can actually find 76er have a pretty decent performance. However, if we cross-reference with its defensive strategy and high 2pct ratio. This figure may not be as convincing. This is one example of how we can integrate the traditional data and the alternative data.




4.2 Team specific analysis




In the section, we zoom down to the top 2 and bottom 2 teams in both the east and west regions. Instead of analyzing the tradtional team statistics, we choose to look at the team performance in the clutch time. Unlike other sports, the last few seconds in a basketball match can make a huge difference. Furthermore, NBA players do not have a huge difference in their performance in normal times as compared to other sports. In the clutch time, when every player is on their term, it is a true test of their mental stability, stamina and skills. Their difference in abilities and performance will be amplified in the final few seconds. Therefore, we believe that analyzing clutch time performance can give us great insight into the performance of the team.




4.2.1 3pcts vs 3fgm




#Plot on Top4 Last4
df_3pct['df_3fgm_overall']=df_3fgm$overall
df_pct3_v1 =  df_3pct
df_pct3_v1_2 = df_pct3_v1[!is.na(df_3fgm$player_name),]

TopLowP_TSP = df_pct3_v1_2[df_pct3_v1_2$Team_Name %in% TopLowTeam,]
TopLowP_TSP['Rank'] = ifelse(TopLowP_TSP$Team_Name %in% c("Celtics","Cavaliers","Warriors","Spurs"), "Top4", "Down4")

p_3FGM3PCT = ggplot(TopLowP_TSP)+
  geom_point(aes(df_3fgm_overall,overall,color = player_name,shape=Rank),size = 1)+
  facet_wrap(~Team_Name)+
  labs(title = "3pct_overall V.S 3fgm_overall Facet on Top4Last4",x = '3fgm_overall', y='3pct_overall')
ggplotly(p_3FGM3PCT)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`


This is a traditional method to analyze the performance of the team. 3 points is an important way to score in the basketball game and has a dominant effect on the final results of the game. Like our previous analysis, all top 4 teams have very high 3 poins rate. The rate is extremely high for Spurs which confirms our previous analysis.




###Team Average Overall fgm

##==================================================================
#Plot on All team

df_all$Team_Name.x = as.factor(df_all$Team_Name.x)
countorder = df_all %>% group_by(Team_Name.x) %>% summarize(av=mean(overall.x, na.rm=TRUE))

#df_all = merge(df_fgm,df_pct,by = "player_id",all=TRUE)
ggplot(countorder, aes(reorder(Team_Name.x,av),av)) + 
  geom_col(color = "tomato", fill = "orange", alpha = .2)+
  coord_flip()+
  theme_scientific()+
  labs(title = "Team Average Overall fgm",x = 'Team', y='Average Overall fgm')

##==================================================================
#Plot on Top4 Last4
TopLowP_TSP_1 = df_all[df_all$Team_Name.y %in% TopLowTeam,]
countorder = TopLowP_TSP_1 %>% group_by(Team_Name.x) %>% summarize(av=mean(overall.x, na.rm=TRUE))
countorder['Rank'] = ifelse(countorder$Team_Name.x %in% c("Celtics","Cavaliers","Warriors","Spurs"), "Top4", "Down4")
#countorder
countorder
## # A tibble: 8 x 3
##   Team_Name.x    av Rank 
##   <fct>       <dbl> <chr>
## 1 76ers        3.91 Down4
## 2 Cavaliers    4.23 Top4 
## 3 Celtics      4.70 Top4 
## 4 Lakers       4.56 Down4
## 5 Nets         3.35 Down4
## 6 Spurs        3.60 Top4 
## 7 Suns         3.85 Down4
## 8 Warriors     4.00 Top4
ggplot(countorder, aes(reorder(Team_Name.x,av),av,fill = Rank)) + 
  geom_col()+
  coord_flip()+
  theme_scientific()+
  labs(title = "Team Average Overall fgm",x = 'Team', y='Average Overall fgm')+
  scale_colour_colorblind("Rank",
                          labels=countorder$Rank)

Team average overall fgm is a very important traditional factor to measure the performance of the team. We can observe that strong teams do have the tendency to have higher fgm. Spurs seems to be an outlier. However, if we combine our figure with our previous analysis on the agressvieness of Spurs, the high 3 points ratio and the high sucess rate. The relatively low overall fgm can be easily understood. This is another example of how we can link various part together to derive meaning results.




### Coordinates plot

# average within group 3point


cbP = c("#999999", "#E69F00", "#56B4E9", "#009E73",
        "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

df_3fgm_sum = aggregate(df_3fgm[,3:12], list(df_3fgm$Team_Name), sum, na.rm = TRUE)
deno = df_3fgm/df_3pct[,1:13]
## Warning in Ops.factor(left, right): '/' not meaningful for factors

## Warning in Ops.factor(left, right): '/' not meaningful for factors
deno$player_name = df_3fgm$player_name
deno$player_id = df_3fgm$player_id
deno$Team_Name = df_3fgm$Team_Name
deno_modi = aggregate(deno[,3:12], list(deno$Team_Name), sum, na.rm = TRUE)
average3point = df_3fgm_sum/deno_modi
## Warning in Ops.factor(left, right): '/' not meaningful for factors
average3point$Group.1=deno_modi$Group.1
average3point[is.na(average3point)] = 0

TopLowTeam = c("Celtics","Cavaliers","Warriors","Spurs",
               "Lakers","Suns","76ers","Nets")
TopLow3point = average3point[average3point$Group.1 %in% TopLowTeam,]

RK = ifelse(TopLow3point$Group.1 %in% c("Celtics","Cavaliers","Warriors","Spurs"), "Top4", "Down4")
TopLow3point['TRk']= RK 
#TopLow3point
p1 = ggparcoord(data = TopLow3point,
                columns =2:7,
                mapping=aes(color=as.factor(Group.1),
                            linetype = as.factor(TRk)),
                scale = 'globalminmax'
                )+
  scale_linetype_discrete("Rank",
                          labels=TopLow3point$TRk)+
  #scale_color_discrete("Team",
  #                     labels=TopLow3point$Group.1)+
  geom_vline(xintercept = 0:6, color = "lightblue")+
  theme(axis.text.x=element_text(angle=90))+
  labs(title = "Average 3PT Last Xmin yDown Top4 V.S Low4",x = 'Indicator', y='Team Average')+
  scale_colour_colorblind("Team",
                       labels=TopLow3point$Group.1)

p2 = ggparcoord(data = TopLow3point,
                columns =c(2,8:11),
                mapping=aes(color=as.factor(Group.1),
                            linetype = as.factor(TRk)),
                scale = 'globalminmax'
                )+
  scale_linetype_discrete("Rank",
                          labels=TopLow3point$TRk)+
  #scale_color_discrete("Team",
  #                     labels=TopLow3point$Group.1)+
  geom_vline(xintercept = 0:6, color = "lightblue")+
  theme(axis.text.x=element_text(angle=90))+
  labs(title = "Average 3PT Last Xmin yDownorHiger Top4 V.S Low4",x = 'Indicator', y='Team Average')+
  scale_colour_colorblind("Team",
                       labels=TopLow3point$Group.1)
# average within group all point



cbP = c("#999999", "#E69F00", "#56B4E9", "#009E73",
        "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

df_fgm_sum = aggregate(df_fgm[,3:12], list(df_fgm$Team_Name), sum, na.rm = TRUE)
deno = df_fgm/df_pct[,1:13]
## Warning in Ops.factor(left, right): '/' not meaningful for factors

## Warning in Ops.factor(left, right): '/' not meaningful for factors
deno$player_name = df_fgm$player_name
deno$player_id = df_fgm$player_id
deno$Team_Name = df_fgm$Team_Name
deno_modi = aggregate(deno[,3:12], list(deno$Team_Name), sum, na.rm = TRUE)
averagepoint = df_fgm_sum/deno_modi
## Warning in Ops.factor(left, right): '/' not meaningful for factors
averagepoint$Group.1=deno_modi$Group.1
averagepoint[is.na(averagepoint)] = 0

TopLowTeam = c("Celtics","Cavaliers","Warriors","Spurs",
               "Lakers","Suns","76ers","Nets")
TopLowpoint = averagepoint[averagepoint$Group.1 %in% TopLowTeam,]

RK = ifelse(TopLowpoint$Group.1 %in% c("Celtics","Cavaliers","Warriors","Spurs"), "Top4", "Down4")
TopLowpoint['TRk']= RK 
#averagepoint


p3 = ggparcoord(data = TopLowpoint,
                columns =2:7,
                mapping=aes(color=as.factor(Group.1),
                            linetype = as.factor(TRk)),
                scale = 'globalminmax'
                )+
  scale_linetype_discrete("Rank",
                          labels=TopLow3point$TRk)+
  #scale_color_discrete("Team",
  #                     labels=TopLow3point$Group.1)+
  geom_vline(xintercept = 0:6, color = "lightblue")+
  theme(axis.text.x=element_text(angle=90))+
  labs(title = "Average TotalPT Last Xmin yDown Top4 V.S Low4",x = 'Indicator', y='Team Average')+
  scale_colour_colorblind("Team",
                       labels=TopLowpoint$Group.1)

p4 = ggparcoord(data = TopLowpoint,
                columns =c(2,8:11),
                mapping=aes(color=as.factor(Group.1),
                            linetype = as.factor(TRk)),
                scale = 'globalminmax'
                )+
  scale_linetype_discrete("Rank",
                          labels=TopLow3point$TRk)+
  #scale_color_discrete("Team",
  #                     labels=TopLow3point$Group.1)+
  geom_vline(xintercept = 0:6, color = "lightblue")+
  theme(axis.text.x=element_text(angle=90))+
  labs(title = "Average TotalPT Last Xmin yDownorHiger Top4 V.S Low4",x = 'Indicator', y='Team Average')+
  scale_colour_colorblind("Team",
                       labels=TopLowpoint$Group.1)

grid.arrange(p1, p2, p3, p4, nrow = 2)




From this coordinates plot we can observe here that, traditonal performance measure in clutch time fails to gives us a good indication. This did not meet our expectation, as our original statement was to native to ignore why clutch time will happen in the first place. When a strong team enter clutch time, it is usually due to the major players are in bad shape that day or they will have finished the game in main time. That is why clutch time fails to give us a good indication.




4.2.2 Further analysis on 30s clutch time

##==================================================================
#Plot on ALL
df_pct['df_fgm_overall']=df_fgm$overall
df_pct_v1 =  df_pct
df_pct_v1_2 = df_pct_v1[!is.na(df_fgm$player_name),]


p_FGMPCT = ggplot(df_pct_v1_2)+
  geom_point(aes(df_fgm_overall,overall,color = player_name),size = 1)+
  facet_wrap(~Team_Name)+
  labs(title = "pct_overall V.S fgm_overall ",x = 'fgm_overall', y='pct_overall')
ggplotly(p_FGMPCT)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
df_pct['df_fgm_overall']=df_fgm$X30sec_plusminus_5
df_pct_v1 =  df_pct
df_pct_v1_2 = df_pct_v1[!is.na(df_fgm$player_name),]

TopLowP_TSP = df_pct_v1_2[df_pct_v1_2$Team_Name %in% TopLowTeam,]
TopLowP_TSP['Rank'] = ifelse(TopLowP_TSP$Team_Name %in% c("Celtics","Cavaliers","Warriors","Spurs"), "Top4", "Down4")

p_FGMPCT = ggplot(TopLowP_TSP)+
  geom_point(aes(df_fgm_overall,X30sec_plusminus_5,color = player_name,shape=Rank),size = 2)+
  facet_wrap(~Team_Name)+
  labs(title = "pct_X30sec_plusminus_5 V.S fgm_X30sec_plusminus_5 Facet on Top4Last4",x = 'fgm_X30sec_plusminus_5', y='pct_X30sec_plusminus_5')
ggplotly(p_FGMPCT)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`


In the plot, we take a deeper look at the final 30s when the score is tight. This situation is different from the situation above, because in last 30 seconds with plus or down 3 points, everything can happen. This is the real clutch time, but the same thing is that people usually think in this time we should give the ball to the best players to handle. The interesting is Warriors, who is the champion of the last season, the two best players in the team, Kevin Durant and Stephen Curry both have very low pct and fgm compared to the their normal statistics. This confirmed that our previous analysis when a strong team enters clutch time, the star players are usually not performing well that day. However Shawn Livingston the player with more than 10 years’ experience in NBA seems more productive in last 30 seconds’ clutch time. Same thing can be found in the other top 4 teams, veterans usually have better performance, like Al Horford in Celtics, Tony Park in Spurs, even though they are now not the one of the best players in the team, but they can be the best in the clutch time. Advice for coaches: give the ball to veterans and adjust your strategy based on the actual performance of the players on that day.

4.2.3 3pts average 10second down figure plot

##==================================================================
#Plot on All Teams
averagepoint=averagepoint[2:31,]
averagepoint['abbr'] = df_name_team_abbr[,1]

average3point=average3point[2:31,]
average3point['abbr'] = df_name_team_abbr[,1]

path = 'https://github.com/NiHaozheng/NBA-Visualization/blob/master/clutch_team/logo/'
averagepoint$img = paste(path,averagepoint$abbr,'.png?raw=true',sep='')
average3point$img = paste(path,average3point$abbr,'.png?raw=true',sep='')


##==================================================================
#Plot on Top4 Last4
TopLowP_TSP_1 = averagepoint[averagepoint$Group.1 %in% TopLowTeam,]
TopLowP_TSP_2 = average3point[average3point$Group.1 %in% TopLowTeam,]

p3 = ggplot(TopLowP_TSP_1,aes(overall,X10sec_down_3))+
  geom_point()+
  geom_image(image = TopLowP_TSP_1$img,
             size = .05)+
  theme_scientific()+
  labs(title = "3pt Average 10sec_down_3 v.s. Overall TopDown4",x = 'Overall', y='X10sec_down_3')

p4 = ggplot(TopLowP_TSP_2,aes(overall,X10sec_down_3))+
  geom_point()+
  geom_image(image = TopLowP_TSP_2$img,
             size = .05)+
  theme_scientific()+
  labs(title = "Total Average  X10sec_down_3 v.s. Overall TopDown4",x = 'Overall', y='X10sec_down_3')
grid.arrange(p3, p4, nrow = 1)




Although the tradtional method in general fails to give us the result we are looking for. The 3pt average performance in the last 10 seconds is highly correlated with the ranking of the team. This figure plot gives us a clear visual representation of the data. One potential reason for this will be strong teams usually have a greater player pool, they will have p points shooter designated for the final shoot. This is why strong team in general have a better last 10 second performance(despite the star players may not in a good shape as we have explained above)




4.3 Player specific analysis




As for individuals, we mainly covers the shooting pattern and missing rate. This will be covered in detail with our interactive components


4.4 Miscellaneous plots without significant discoveries




During our analysis, we have looked have a large number of plots and explored many different aspects. However, we cannot obtain meaningful patterns from some of them. We simply included them in this section to demonstrate the path we have taken.


### TSP VS PTS All Star

# Define FGA: Field Goal Attempt 
FGA = df_fgm$overall / df_fct$overall
# Define TSP: True shooting percent 
TSP = df_pts$overall/(2*(FGA+0.44*df_fta$overall))
df_pts['TSP'] = TSP
# Make a copy of df_pts
df_pts_v1 = df_pts
# Subset to remove all the NAs due to players that did not have a team or did not play in 2016
df_pts_v1_2 = df_pts_v1[!is.na(df_pts_v1$TSP),]

##==================================================================
#Plot on whole data, all teams 

p_TSP_All = ggplot(df_pts_v1_2)+
  geom_point(aes(overall,TSP,color = player_name,shape = Team_Name),size = 2)+
  labs(title = "TSP V.S PTS All Star",x = 'Overall PTS', y='Overall TSP')
ggplotly(p_TSP_All)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have
## 31. Consider specifying shapes manually if you must have them.




### TSP VS PTS on X5min_plusminus_5

# Define FGA: Field Goal Attempt on X5min_plusminus_5
FGA = df_fgm$X5min_plusminus_5 / df_fct$X5min_plusminus_5
# Define TSP: True shooting percent 
TSP = df_pts$X5min_plusminus_5/(2*(FGA+0.44*df_fta$X5min_plusminus_5))
df_pts['TSP'] = TSP
# Make a copy of df_pts
df_pts_v1 = df_pts
# Subset to remove all the NAs due to players that did not have a team or did not play in 2016
df_pts_v1_2 = df_pts_v1[!is.na(df_pts_v1$TSP),]


p_TSP_All = ggplot(df_pts_v1_2)+
  geom_point(aes(X5min_plusminus_5,TSP,color = player_name,shape = Team_Name),size = 2)+
  labs(title = "TSP V.S PTS All Star",x = 'X5min_plusminus_5 PTS', y='X5min_plusminus_5 TSP')
ggplotly(p_TSP_All)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have
## 31. Consider specifying shapes manually if you must have them.




### 3pcts_overall VS 3fgm_overall

##==================================================================
#Plot on All Team

df_3pct['df_3fgm_overall']=df_3fgm$overall
df_pct3_v1 =  df_3pct
df_pct3_v1_2 = df_pct3_v1[!is.na(df_3fgm$player_name),]
p_3FGM3PCT_All = ggplot(df_pct3_v1_2)+
  geom_point(aes(df_3fgm_overall,overall,color = player_name,shape = Team_Name),size = 2)+
  labs(title = "3pct_overall V.S 3fgm_overall ",x = '3fgm_overall', y='3pct_overall')
ggplotly(p_3FGM3PCT_All)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have
## 31. Consider specifying shapes manually if you must have them.




### ftm_30sec_plusmiuns_5

##==================================================================
#Plot on All teams

df_fta['df_ftm_30sec_plusmiuns_5'] = df_ftm$X30sec_plusminus_5
df_fta_v1 =  df_fta
df_fta_v1_2 = df_fta_v1[!is.na(df_fta$player_name),]
p_fta_ftm = ggplot(df_fta_v1_2)+
  geom_point(aes(X30sec_plusminus_5,
                 df_ftm_30sec_plusmiuns_5,
                 color = player_name,
                 shape=Team_Name),
             size = 1.3,
             alpha=0.5,
            position = "jitter")+
  labs(title = "df_ftm_30sec_plusmiuns_5 V.S X30sec_plusminus_5 ",x = 'X30sec_plusminus_5', y='df_ftm_30sec_plusmiuns_5')
ggplotly(p_fta_ftm)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have
## 31. Consider specifying shapes manually if you must have them.




### 1min_down5 plot

#Plot on Top4 Last4
TopLowP_TSP_1 = df_pct[df_pct$Team_Name %in% TopLowTeam,]
ggplot()+
  geom_point(data =TopLowP_TSP_1,
             aes(x = X1min_down_5, y= overall),
             position = position_jitter(w = 0.01, h = 0.02),
             alpha = 0.5,
             size = 3)+
  facet_wrap(~Team_Name)+
  labs(title = "overall V.S X1min_down_5",
       x = 'X1min_down_5', 
       y='overall')

4.4.1 pair plots

pairs(df_all[c("X10sec_down_3.x","X10sec_down_3.y","X30sec_down_3.x","X30sec_down_3.y")])

#df_all
pairs(df_all[c("X1min_down_5.x","X1min_down_5.y",
               "X3min._down_5.x","X3min._down_5.y",
               "X5min._down_5.x","X5min._down_5.y")])

#df_all
pairs(df_all[c("X30sec_plusminus_5.x","X30sec_plusminus_5.y",
               "X1min_plusminus_5.x","X1min_plusminus_5.y",
               "X3min_plusminus_5.x","X3min_plusminus_5.y")])

5 Executive Summary (Presentation-style)




5.1 Shiny shooting map

The target audience of our report is the sports team mangers or investors. We would like to focus more on the general performance of the team and how to devise strategies for the team based on our information. In the presentation, we will first start with the following plot:

#number of games played vs number of wins
df1 = clutch[,c('GP','W','team')]
df1= gather(df1,type,count,-team)
#df1$count <-  ifelse(df1$type =="W",df1$count*(-1),df1$count)
temp = df1[df1$type=='GP',]
new_levels=  as.character(temp[order(temp$count),]$team)
df1$team = factor(df1$team,levels=new_levels)
#df1 <- within(df1, team <- factor(team, levels=names(sort(count,  decreasing=TRUE))))
df1 %>% ggplot(aes(x=team, y=count, fill=type))+
  geom_bar(stat="identity",position="identity")+
  xlab("number of games")+ylab("name of teams")+
  scale_fill_manual(name="type of games",values = pal("five38"))+
  coord_flip()+ggtitle("number of games played (GP) v.s number of wins (W)")+
  geom_hline(yintercept=0)+
  ylab("number of games")+
  xlab("team name")+
  scale_y_continuous(breaks = pretty(df1$count),labels = abs(pretty(df1$count)))+
  theme_scientific()


This plot will allow the executives to have a brief overview of where each team is standing in the league.

Then, we will shift our focus to the shooting pattern of the individual players in our shiny plot. The shiny plot will allow us to have a direct visualization on the shooting pattern and the missing rate of the person. This will allow the managers to come up with strategies on who to take the final shot, where to take the final shot to maximize the winning rate. This will also allow the opponent team to come up with strategies on who on the opponents to block so that there is a higher chance for them to miss the shoot. We will illustrate with an example in the following part:

Let’s take the shooting pattern of Kobe in 14-15 for example. The following graph displays all the 3 points Kobe have attempted in the final minutes of the game.
Hexagonal
The follinwg graph will display the points that are successfully made. We can see that Kobe actually miss more than half of the 3 points. More interestingly, Kobe seems to be more comfortable with shooting from the right side of the map.
Hexagonal
As a basketball team manager, the most intuitive thing will be to look at the shooting pattern of all players within the team so that he can decide which person to choose for the final shoot. He can also advise Kobe to take the shoot from the right side to maximise his success rate. As the manager from the opponent team, when Kobe is in control of the ball, he can ask the team to put more defence at the right side since that is the region Kobe is most comfortable with.


5.2 Shiny Word Cloud

As a basketball manager, the team’s public image is of high importance. It is directly related to the sponsorships and fundings of the team. We can observe some interesting public options which can allow us to devise our advertising strategies.

5.2.1 Word Cloud on Warriors

Hexagonal
We can observe that for warriors, the most popular is Stephen Curry and the most concerned opponent is Spurs. With this in mind, the team manager can put more emphasis on branding Curry and give more publicity to Curry to meet the demand from the public.

5.2.2 Word Cloud on Lakers


Lakers’ fans seem to focus more on players at the time. For example, Kawhi Leonard, maybe due to a rumor about the possibility of trade between spurs and lakers about Kawhi. Paul George has the same situation because last off-sseason’s statement that we wanted to play for this hometown–Los Angles Lakers. We can see the word “trade”’s high popularity within the tweets. As a team manager, when the rumor among the public is too strong, he needs to take certain action to make sure the rumor does not do any harm to the team. Our word cloud can be a quick way to have a glimpse over the public opinions.

In conclusion, Instead of coming up with a general conclusion for the executives, we do feel that having a flexible system that everyone can use easily to get the information they need is much more useful. An industry like NBA is just too flexible, it is almost impossible for us to come up with a conclusion that fit everyone, therefore, a half-tailored system will allow the executives to get their desired results most easily.


6 Interactive Component




We have created an interactive plot using shiny. This program is inspired by the open-source ballr project. It will automatically fetch data from the nba.stat.com based on your selection. Our data only covers the clutch time, which is the final few minutes of the match. We want to explore if a player or a team exhibit any pattern of shooting in the final minutes. We will also explore which players have a high score rate and what is his favourite shooting regions. We believe understanding the scoring pattern and distribution of shooting locations can give valuable information to the teams themselves who can leverage on this information to devise clutch time strategies and their opponents who can make use of the information to counter them. We have put in 3 formats of plot: hexagonal, scatter, and heat map.




6.1 Shoot Map




6.1.1 Chart options


6.1.1.1 Hexagonal

Hexagonal charts use R’s hexbin package to bin shots into hexagonal regions. The size and opacity of each hexagon are proportional to the number of shots taken within that region, and the color of each hexagon represents your choice of metric.

There are two sliders to adjust the maximum hexagon sizes, and also the variability of sizes across hexagons, e.g. here’s the same Stephen Curry chart but with larger hexagons, and plotting points per shot as the color metric. The color metrics are not plotted at the individual hexagon level, but at the court region level.
Hexagonal


6.1.1.2 Scatter

Scatter charts are the most straightforward option: they show the location of each individual shot, with color-coding for makes and misses.
Hexagonal

6.1.1.3 Heat map

Heat map charts use two-dimensional kernel density estimation to show the distribution of shot attempts across the court. Unsurprisingly, that most shot attempts are taken in the restricted area near the basket.
Hexagonal


6.1.2 Instructions


In order to use our plot:

1:select the team and players on the top side of the page.

2:select the season

3:select the minutes remaining (which has been set to 5 to analyse clutch time)

4:select your choice of chart and the details of the chart

5:select the shot zones, shot angles, shot distance, FG made/Miss from the dropdown box




6.1.3 Potential improvements




In out plot, we only choose time as a horizon to select data. However, we believe it will be valuable to observe if the shooting pattern of the players changes over time. Hence, in the future, we may choose time as another variable on the plot instead of just using it to select the data.

While trying to publish the website, we encountered some difficulties. The program runs perfectly locally and we included all the package we have used in the code. I have also posted a question on the blog for R-users after taking the advice from the Professor. However, there does not seem to be any meaningful solution. One potential reason it because that our program is too large for it too load. We keep getting time out error. I have double checked with the professor and she confirmed that we can use the local version for our shoot map. The following code will allow you to run the program stored in my github.

library(shiny) runGitHub(“shootmap”, “nmy411”)




6.2 Word Cloud




In order to visualise the public opinions about the team and what they are looking for from the team. We have created a word cloud in Shiny. In the word cloud, we crawled the data from twitter and cleaned the text data to make it into a usable format. We have removed Punctuations, Numbers, English Stop words and stiped whitespaces. The wordcloud is uploaded onto the website:https://haozheng1995.shinyapps.io/wordcloud/ . A screen of the interface is provided below: Hexagonal
This is much easier to use as comparied to our shoot map. You only need to select the time horizon of the twitter you are interested in from the bar on the left side. If you move your mouse over onto the word, it will display the number of times that word appeared. In order to emphasis the importance of the top opinions and for the sake of clearity, we only coloured the top 5 words.


6.2.1 Potential improvements




The weakness of our wordcloud is that we have preprosessed the data with python. It is not able to retrieve data online within the program due to the noisy text data. In the future, we may want to improve on this program so that i can retrieve data and clean them within itself.


7 Conclusion




During our project, the scope we want to achieve is too large. Despite narrowing our focus to clutch time performance, We tried to cover the too many aspects of the basketball match. This makes our analysis too dispersed and lacks depth and a clear progress between different part of the analysis. In the future, we may want to zoom even deeper into a small part of the match and carry out indepth analysis into it. For example, an indepth analysis on the final 10 second performance of the players and the strategy of the coaches under different situations.

We have also learnt we should not be too obsessed of representing each team individually. Even have 8 colours for 8 different team can be pretty distracting in our plot. Next time, we may consider only have two colours, one for the top 4 teams and one for the bottom 4 teams. This may give us better visual presentaiton and may allow us to find some interesting patterns.

Despite the above limitations, we still believe that our project has been a success. We have analysed data which have been overlood by the current analysis and established valuable connection with the traditional methods. These new approach can be used as alternative indicators of performance when the traditional approach fails to serve the purpose. Our presentation to the executives does not aim to convey a solution. Instead, we provided tool and strategies for coaches and managers to devise strategies which fit their team best.